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Abstract —Many applications collect a large number of time 
series, for example, the financial data of companies quoted in 
a stock exchange, the health care data of all patients that visit 
the emergency room of a hospital, or the temperature sequences 
continuously measured by weather stations across the US. These 
data are often referred to as unstructured. A first task in its 
analytics is to derive a low dimensional representation, a graph 
or discrete manifold, that describes well the /wtcrrelations among 
the time series and their /ntmrelations across time. This paper 
presents a computationally tractable algorithm for estimating 
this graph that structures the data. The resulting graph is 
directed and weighted, possibly capturing causal relations, not 
just reciprocal correlations as in many existing approaches in the 
literature. A convergence analysis is carried out. The algorithm 
is demonstrated on random graph datasets and real network 
time series datasets, and its performance is compared to that 
of related methods. The adjacency matrices estimated with the 
new method are close to the true graph in the simulated data 
and consistent with prior physical knowledge in the real dataset 
tested. 

Keywords: Graph Signal Processing, Graph Structure, 
Adjacency Matrix, Network, Time Series, Big Data, Causal 

1. Introduction 

There is an explosion of data, generated, measured, and 
stored at very fast rates in many disciplines, from finance and 
social media to geology and biology. Much of this data takes 
the form of simultaneous, long running time series. Examples 
include protein-to-protein interactions in organisms, patient 
records in health care, customer consumptions in (power, 
water, natural gas) utility companies, cell phone usage for 
wireless service providers, companies’ financial data, and 
social interactions among individuals in a population. The 
internet-of-things (loT) is an imminent source of ever increas¬ 
ing large collections of time series. This data is often referred 
to as i/^structured. 

Networks or graphs are becoming prevalent as models to 
describe the relationships among the nodes and the data they 
produce. These low-dimensional graph data representations are 
used for further analytics, for example, to compute statistics, 
make inferences, perform signal processing tasks [1], [2], [3], 
or quantify how topology influences diffusion in networks of 
nodes [4]. These methods all use the structure of the known 
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graphs to extract knowledge and meaning from the observed 
data supported on the graphs. 

However, in many problems the graph structure itself may 
be unknown, and a first issue is to infer from the data 
the unknown relations between the nodes. The diversity and 
w^structured nature of the data challenges our ability to derive 
models from first principles; alternatively, because data is 
abundant, it is of great significance to develop methodologies 
that, in collaboration with domain experts, assist extracting 
low-dimensional representations that structure the data. Early 
work in estimating low-dimensional graph-like structure in¬ 
cludes dimensionality reduction approaches such as [5], [6]. 
These methods work on a static snapshot of data and do not 
incorporate the notion of time. 

This paper focuses on estimating the network structure 
capturing the dependencies among time series in the form 
of a possibly directed, weighted adjacency matrix A. Current 
work on estimating network structure largely associates it with 
assuming that the process supported by the graph is Markov 
along the edges [7], [8] or aims to recover causality [9] in 
the sense popularized by Granger [10]. Our work instead 
associates the graph with causal network effects, drawing 
inspiration from the Discrete Signal Processing on Graphs 
(DSPg) framework [3], [11]. 

We provide a brief overview of the concepts and notations 
underlying DSPq and then introduce related prior work in 
section II and our new network process in section III. Next, 
we present algorithms to infer the network structure from data 
generated by such processes in section IV. We provide analysis 
on the convergence of our algorithms and the performance 
of the models for prediction in section V. Einally, we show 
simulation results in section VI and conclude the paper in 
section VII. 

II. Relation to Prior Work 

We briefiy review DSPq and then describe previous models 
and methods used to estimate graph structure. Section II-A 
considers sparse Gaussian graphical model selection. Sec¬ 
tion II-B sparse vector autoregression, and Section II-C other 
graph signal processing approaches. 

DSPg review 

Consider a graph G = (V, A) where the vertex set V = 
and A is the weighted adjacency matrix of the 
graph. A graph signal is defined as a map, x : V ^ C, 

Xn, that we can write as x = (xq xi ... xn-i)^ G C^. 

We adopt the adjacency matrix A as a shift [11], and 
assuming shift invariance and that the minimal polynomial 
mA {x) of A equals its characteristic polynomial, graph filters 




2 


in DSPg are matrix polynomials of the form h(A) = hoi -h 
hiA -h ... -h hLf^A^^, where is the polynomial order. 

A. Sparse Graphical Model Selection 

Sparse inverse covariance estimation [12], [13], [8] com¬ 
bines the Markov property with the assumptions of Gaussian- 
ity and symmetry using Graphical Lasso [12]. Let the data 
matrix representing all the K observations is given, 

X= (x[0] x[l] ... ^[K-1]) (1) 

where x[i] ^ A/’(0,5]), {x[i]} are i.i.d., and an estimate for 
© = is desired. The regularized likelihood function is 
maximized, leading to 

0 = argmin tr (S©) — log |©| + A ||©||^, (2) 

© 

where S = is the sample covariance matrix and 

l|e||i = Eie»,l- 

Given time series data generated from a sparse graph 
process, the inverse covariance matrix © can actually reflect 
higher order effects and be signiflcantly less sparse than the 
graph underlying the process. For example, if a process is 
described by the dynamic equation with sparse state evolution 
matrix A and ||A|| < 1, 

x.[k] = Ax[/c — 1] + w[/c], 

where w[i] ^ A/’(0, Xw) is a random noise process that is 
generated independently from w[j] for all i j, then 

oo 

S = E [x[k]x[kf] = y] A*Sw (A^)* 

i=0 

^®= . 

Even though the process can be described by sparse matrix 
A, the true value of © represents powers of A and need not 
be as sparse as A. In addition, A may not be symmetric, i.e., 
the underlying graph may be directed, while © is symmetric, 
and the corresponding graph is undirected. 


B. Sparse Vector Autoregressive Estimation 

For time series data, instead of estimating the inverse 
covariance structure, sparse vector autoregressive (SVAR) es¬ 
timation [14], [15], [9] recovers matrix coefficients for multi¬ 
variate processes. This SVAR model assumes the time series 
at each node are conditionally independent from each other 
according to a Markov Random Field (MRF) with adjacency 
structure given by A' G {0,1}^^^. This A' is related to 
Granger causality [16], [17] as shown in [9]. This problem 
assumes given a data matrix of the form in equation (1) that 
is generated by the dynamic equation with sparse evolution 
matrices 

M 

x.[k] = ^ A^*^x[/c — i] A- w[/c], 

i=l 


where w[i] is a random noise process independent from w[ji], 
i ^ j, and A*^*^ all have the same sparse structure, AL = 0 ^ 
A,-^^ = 0 for all k. Then SVAR solves, 

f’j 


{A^®^} = argmin - 

{AW} 


M 

x[/c] — ^ A^'^^x.[k — i] 

i=l 
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(3) 


where A\f ... ^ and Waijh promotes 

sparsity of the (i, jf)-th entry of each A^*^ simultaneously and 
thus also sparsity of A'. This optimization can be solved using 
Group Lasso [12]. SVAR estimates multiple weighted graphs 
A*^*^ that can be used to estimate the sparsity structure A'. 

In contrast, our model is defined by a single weighted 
A, a potentially more parsimonious model than (3). The 
corresponding time Alter coefficients (introduced in section III) 
are modeled as graph Alters. Using this single adjacency matrix 
A and graph Alters to describe the process enables principled 
analysis using the toolbox provided by the DSPq framework. 


C. Graph Signal Processing using the Laplacian 

Frameworks for signal processing on graphs using the 
weighted (and/or possibly normalized) Laplacian matrix rather 
than the weighted adjacency matrix have been proposed [18], 
[19], [20]. For example, with a known graph Laplacian matrix, 
the polynomial coefficients for optimal graph Alters can be 
learned and used to describe and compress signals [21]. 

The symmetric Laplacian is positive semideflnite with all 
real nonnegative eigenvalues and a real orthonormal eigen¬ 
vector matrix. Most Laplacian-based Graph Signal Processing 
methods assume the Laplacian to be symmetric and implicitly 
take advantage of these properties in various ways. 

Recently, methods for estimating symmetric Laplacians 
have been proposed and came to our attention after the sub¬ 
mission of this paper [22], [23], [24]. These methods estimate 
Laplacians for independent graph signals with an interpretation 
similar to the (inverse) covariance matrix in section II-A, 
and do not take into account the temporal structure and 
dependencies of the data. In addition, these methods all depend 
implicitly on the symmetry of the Laplacian, which yields 
an orthonormal eigenbasis. Furthermore, they depend on the 
conic geometry of the space of symmetric positive semideflnite 
matrices, which allows the utilization of convex optimization. 

Symmetric Laplacians correspond to undirected graphs, 
having nonnegative real eigenvalues. This may be restrictive 
in applications, since it assumes symmetric relations among 
time series data. Asymmetric Laplacians are also now being 
studied, but they are restricted to having zero row (or column) 
sums, which is often undesirable. 

In contrast, the directed adjacency matrix A that we assume 
may have positive as well as negative weights on edges and can 
have complex eigenvalues. We add that knowing the adjacency 
matrix does allow us to compute the Laplacian. In this paper, 
we adopt the adjacency matrix as the basic building block. 
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III. Causal Graph Processes 

Consider Xn [k ], a discrete time series on node Vn in graph 
G = (V, A), where n indexes the nodes of the graph and k 
indexes the time samples. Let N be the total number of nodes 
and K be the total number of time samples, and 

x[fc]= {xo[k] Xi[k] ... XN.i[k]f 

represents the graph signal at time sample k. 

We consider a Causal Graph Process (CGP) to be a discrete 
time series x.[k] on a graph G = (V, A) of the following form, 

M 

x.[k] =w[/c] + ^ ^i(A, c)x[/c — i] 

i=l 
M 

4 

=w[k] + (ciol + ciiA)x[/c - 1 ] 

+ (C 20 l + C 21 A + C 22 A^) x[/c - 2 ] + . . . 

+ (cmoI + ... + cmmA^) x.[k — M], 

where Pi (A, c) is a matrix polynomial in A, w[k] is statistical 
noise, Cij are scalar polynomial coefficients, and 

c= (cio Cii ... Cij ... Cmm)^ 

is a vector collecting all the q^’s. 

Note that the CGP model does not assume Markovianity in 
the nodes and edges of the graph adjacency matrix. Instead, the 
CGP is an autoregressive process (Markov) in the time series 
as in (4) whose coefficients Pi (A, c) are graph filters; thus, the 
CGP can incorporate the influence of many more (sometimes 
all) other nodes in a single step. Matrix polynomial Pi(A,c) 
is at most of order min(i, Na), refiecting that x.[k] cannot be 
infiuenced by more than i-th order network effects from i time 
steps ago and in addition is limited (mathematically) by Na, 
the degree of the minimum polynomial of A. Typically, we 
take the model order M <C Na, and for the remainder of the 
paper assume this holds for sake of notational clarity. 

This model captures the intuition that activity on the net¬ 
work travels at some fixed speed (one graph shift per sampling 
period), and thus the activity at the current time instant at a 
given network node cannot be affected by network effects of 
order higher than that speed allows. In this way, the CGP 
model can be seen as generalizing the spatial dimension of 
the light cone [25] to be on a discrete manifold rather than 
only on a lattice corresponding to uniformly sampled space. 

The current parameterization of the CGP model in (4) raises 
issues with identifiability. To address them, we assume that 
Pi(A,c) 7 ^ al for a G M. Then, without loss of generality, 
we can let cio = 0 and cn = 1 so that Pi(A,c) = A. To 
verify this, consider the full parameterization using (A',c'). 
We show that we can instead use the reduced parameterization 
(A, c) with Pi (A, c) = A to represent the same process. First, 
we start with 

Pi (A', c') = c'lol + c'liA' = A = Pi (A, c) 

^A' = (c;i)-i(A-c'ioI). 


=w[A:] + y] 


x.[k — i] 


j=0 


(4) 


We can invert cn, since by assumption c'^ / 0. Then consider 
the i-th polynomial, 

Pi(A',c') = ^4(A'7 = - 4oU 

i=o j=o 

j=0 k=0 ^ ^ 

= EE4Ki)“^' (i) (6) 

k=0j=k ^ ^ 

i 

= y]cifcA'' = P,(A,c), 

/c=0 

when we define 

Gk = ^cC(c'^^) ^ (~^ioy 

j=k ^ ' 

In the remainder of this paper, we assume that Pi (A, c) 7 ^ al 
and use the reduced parameterization with cio = 0 and cn = 1 
so that c G where n = {M — 1)(M + 4)/2 to ensure that 
A and c are uniquely specified without ambiguity. 


IV. Estimating Adjacency Matrices 

Given a time series x(t) on graph G = (V,A) with 
unknown A, we wish to estimate the adjacency matrix A. We 
assume the data follows the CGP model (4). A first approach to 
its estimation can be formulated as the following optimization 
problem. 


K-l 


2 ^ 

(A, c) = argmin - 

^ k=M 


M 


“ E c)x[fc - i] 


i=l 


2 


2 


+ Al ||vec(A) 111 + A 2 ||c||i, 

(7) 

where vec(A) stacks the columns of the matrix A. 

In equation (7), the first term in the right hand side models 
x[/c] by the CGP model (4) in section III, the regularizing 
term Ai ||vec(A) ||i promotes sparsity of the estimated adja¬ 
cency matrix, and the term A 2 ||c||i promotes sparsity in the 
matrix polynomial coefficients. Regularizing c corresponds to 
performing autoregressive model order selection. If the true 
model has Pi (A, c) = 0 Vj < i < M for some 0 < j < M, 
then regularization of c encourages the corresponding values 
to be 0. The matrix polynomial in the first term makes this 
problem nonconvex. That is, using a convex optimization 
based approach to solve (7) directly may result in finding a 
solution (A, c), minimizing the objective function locally, that 
is not near the true globally minimizing (A, c). 

Instead, we break this estimation down into three separate, 
more tractable steps: 


1) Solve for = P^(A,c) 

2) Recover the structure of A 

3) Estimate Cij 
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A. Solving for Pi (A, c) 

As previously stated, the graph filters Pi(A,c) are poly¬ 
nomials of A and are thus shift-invariant and must mutually 
commute. Then their commutator 

[P,(A,c),P,(A,c)] = 

Pi(A,c)Pj(A,c) - Pj(A,c)Pi(A,c) = 0 Vi, j. 

Let Hi = Pi(A,c), R = (Ri,...,Rm), and R^ be the 
estimate of Pi(A,c). This leads to the optimization problem, 

^ K-l M ^ 

R =argmin - ^ x.[k] — ^ Rix[/c — i] 

^ ^ k=M i=l 2 ( 8 ) 

+Ai||vec(Ri)||, + A3^||[Ri,R,-]fp. 

While this is still a non-convex problem, it is multi-convex. 
That is, when R_i = {Hj j i} (all Rj except for R^) are 
held constant, the optimization is convex in R^. This naturally 
leads to block coordinate descent as a solution, 

2 

^ K-l M 

Ri = argmin - x.[k] - Rjx[fc - j] 

^ k=M j=i 2 (9) 

+Ai||vec(Ri)||i + A3y]||[Ri,R,]||^ 

Each of these sub-problems for estimating R^ in a single 
sweep of estimating R is formulated as an -regularized least- 
squares problem that can be solved using standard gradient- 
based methods [26]. We compute a rough estimate of the 
computational cost for solving (9). Assume that finding the 
gradient is the most costly step in the optimization, and naively 
estimate matrix-matrix products to take 0{N^) operations; 
then the optimization has worst-case complexity 0{M‘^N^ + 
KMN‘^) incurred from minimizing over M separate blocks. 
The cost of the Pth problem is dominated in each iteration by 
K—M matrix-vector products Rix[/c—i] with worst case total 
complexity 0{{K — M)N‘^) and by matrix-matrix products 
R^Rj for j i with worst case complexity 0{{M — 

We now improve these cost estimates by noting that, due to the 
sparsity in the R^ matrices, the factor of 0{MN^) resulting 
from matrix multiplications can be reduced to 0{MS‘j^) when 
implemented using sparse matrix multiplications, where S'at is 
the sparsity of the R^ (i.e., max^ ||Ri||o = Sn)- In addition, 
the sparse matrix-vector products can also be reduced to 
0{MSn)- Then, the total complexity is better estimated to 
be 0{MS‘j^), which scales more amenably for large data 
applications. 

B. Recovering A 

After obtaining estimates R^, we find an estimate for A. 
One approach is to take A = Ri. This appears to ignore the 
information from the remaining R^. However, this information 
is taken into account during the iterations when solving for R, 
especially if we begin one new sweep to estimate Ri using (9) 


with i = 1. A second approach is also possible, explicitly using 
all the Hi together to find A, 

A = argmin Ri — A -f Ai ||vec(A)||i 

A 2 

M 

-h As^ I [A,Ri^ 

This can be seen as similar to running one additional step 
further in the block coordinate descent to find Ri, except 
that this approach does not explicitly use the data. This has 
worst-case complexity 0{MN^) dominated by matrix-matrix 
products AR^ for i 1. However, when matrices A and R^ 
are sparse, we again have reduced complexity of 0{MS‘^). 

C. Estimating c 

We can estimate Cij in one of two ways: we can estimate 
c either from A and R^ or from A and the data X. 

To estimate Cij from A and R, we set up the optimization, 

1 \ 2 

Ci = argmin- vec Ri) - Q^Ci +A 2 ||ci||i, (11) 

Ci 2 \ J 2 

where 

= ^vec(I) vec(A) ... vec(A*)^ , 

Ci\ . . . Cii^ 

Alternatively, to estimate Cij from A and the data X, we 
can use the optimization, 

c = argmin i Y(A)-B(A)c +A 2 ||c||i (12) 

c ^ F 

where Y(A) = vec ^Xm — AXm-i^, 

B(A)= (vec(XM- 2 ) ... vec (a^Xm-z) ... vec (a^Xq)), 

and X^ = ( x[m] x[m + 1] ... x[m A K — M — 1] ) , 

where in B(A) the i and j indices increment in 
correspondence to the indexing of Cij in c. That is, 
first set i = 2; then j is incremented from 0, and 

i is incremented, and this repeats until (i,j) = (M, M). 
Then (12) can also be solved using standard -regularized 
least squares methods with worst-case complexity O(M^) 
per iteration, dominated by computing (dense) matrix-vector 
products with B^B G and B^Y G BP, with 

n = (M — 1)(M + 4)/2 = While this may seem 

daunting, we note that M, the lag order of the autoregressive 
process, is usually much lower than the total number of time 
samples K, so that the optimization to find c is unlikely to 
be the bottleneck in the overall model estimation. 

D. Base Estimation Algorithm 

The methods discussed so far can be interpreted as assuming 
that 1) the process is a linear autoregressive process driven by 
white Gaussian noise and 2) the elements in parameters A 
and c a priori follow zero-mean Laplace distributions. Under 
these assumptions, the objective function in (7) approximately 
corresponds to the log posterior density and its optimization 
to an approximate maximum a posteriori (MAP) estimate. 
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This framework can be extended to estimate more general 
autoregressive processes, such as those with a non-Gaussian 
noise model and certain forms of nonlinear dependence of the 
current state on past values of the state. In this case, we can 
formulate the general optimization as 

M 

(A,c) = argmin/i (vec(XM), vec( ^Pi(A, c)Xm-^)) 

A,c ^ ^ 

+ gi(A) + 92 ( 0 ), 

where matrices X^ are defined under (12), /i(', •) is a loss 
function that corresponds to a log-likelihood function dictated 
by the noise model, and ^i(') and ^ 2 (') are regularization 
functions (usually convex norms) that correspond to log- 
prior distributions imposed on the parameters and are dictated 
by modeling assumptions. Again, the matrix polynomials 
Pi(A,c) introduce nonconvexity, so similarly as before, we 
can separate the estimation into three steps to reduce complex¬ 
ity. This leads to analogous formulations for the optimization 
problems (7)-(12) that we omit for brevity and clarity. In the 
remainder of this paper, we refer to the specific formulations 
given in (7)-(12). 

Algorithm 1 summarizes the 3-step algorithm outlined in 
this section for obtaining estimates A and c for the adjacency 
matrix and filter coefficients; it is a more efficient and well- 
behaved alternative to directly using (13). 


Algorithm 1 Base estimation algorithm 

1: Initialize, t = 0, = 0 

2: while not converged do 
3: fori = lj^Mdo ^ ^ 

4: Find R^^^^ with fixed R<), R>7^^ using (9). 

5: end for 

6: t i — t -j- 1 

7: end whil^ ^ ^ 

8: Set A = R^^^ or estimate A from R*^^^ using (10). 

9: Solve for c from X, A using (11) or from X, R 
using (12). 


We call this 3-step procedure the base algorithm. In Algo¬ 
rithm 1, superscripts denote the iteration number, R^- denotes 
j <i} and likewise R^- denotes {Rj^^ : j > 0- 

E. Simplified Estimation Algorithm 

The base algorithm can still be moderately expensive to 
evaluate computationally when scaling to larger problems 
and is difficult to analyze theoretically, mainly due to the 
nonconvexity of the commutativity-enforcing term. For further 
ease of computation and analysis, we consider a simplified 
version of the base algorithm in which the commutativity term 
of the optimization problem (9) is removed, 

M 

R = argmin fi ^vec(XM), 

+ 

i=l 


Note that we have recombined the optimization to be joint 
over all the R^, since this is convex without the commutativity 
term. This can be followed by the san^ ste^ to find A 
and c as in the base algorithm, taking A = Ri or as the 
solution to (10), and then solving (11) or (12) for c. We 
call this the Simplified Algorithm, described in Algorithm 2. 
Using standard solvers for the estimation in (14) with £2 
and £i norms as seen before [26], the worst-case cost is 
0{M{K—M)N^), dominated by computing the matrix-vector 
product Rix[/c — i] for i = 1,..., M and for k = M,..., A. 
Again, by implementing sparse matrix-vector products, we can 
reduce the complexity to 0{M{K—M)Smn) in the best case. 

Algorithm 2 Simplified estimation algorithm 

1: Initialize R = 0 
2: Estimate R using (14) 

3: Set A = Ri or estimate A from R using (10). 

4: Solve for c from X, A using (11) or from X, R 
using (12). 


E Extended Estimation Algorithm 

As an extension of the base algorithm, we can also choose 
the estimated matrix A and filter coefficients c to initialize 
the direct approach of using (13). Starting from these initial 
estimates, we may find better local minima than with initializa¬ 
tions at A = 0 and c = 0 or at random estimates. We call this 
procedure the extended algorithm, summarized in Algorithm 3. 

Algorithm 3 Extended estimation algorithm 

1: Initialize A^^^ = 0 and = 0. 

2: Estimate A*^^\ using basic algorithm. 

3: Eind local minimum A, c using initialization 

from nonconvex problem (13) using convex methods to 
find a local optimum. 


V. Convergence oe Estimation 

In this section, we discuss the convergence of the basic, 
extended, and simplified algorithms described above. Con¬ 
vergence of the Base Algorithm, Algorithm 1, and of the 
Extended Algorithm, Algorithm 3 follow by direct application 
of the results available in literature, as discussed in Sec¬ 
tions V-A and V-B. Performance of the Simplified Algorithm, 
Algorithm 2, is proven in Section V-C. 

A. Base Estimation Algorithm 

In estimating A and c, as illustrated in equations (10) 
and (12), the forms of the optimization problems are well 
studied when choosing £2 and £i norms as loss and reg¬ 
ularization functions [27], [28]. However, using these same 
norms, step 2 (the while loop) of the Base Algorithm is a 
nonconvex optimization. Hence, we would like to ensure that 
step 2 converges. 

When using block coordinate descent for general functions 
(i.e., repeatedly choosing one block of coordinate directions 
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in which to optimize while holding all other blocks constant), 
neither the solution nor the objective function values are 
guaranteed to converge to a global or even a local minimum. 
However, borrowing results from [29], under some mild as¬ 
sumptions, using block coordinate descent to estimate will 
converge. In fact, in equation (13), if we assume the objective 
function to be continuous and to have convex, compact sub- 
level sets in each coordinate block (for example, if the 
functions for /i, gi, and g 2 are the £ 2 , £ 1 , and £i norms 
respectively as in equation (7)), then the block coordinate 
descent will converge. 

B. Extended Estimation Algorithm 

Now we discuss the convergence of the extended estimation 
algorithm described in Algorithm 3 of section IV-F assuming 
that in step 1 of Algorithm 3, the Base Algorithm has 
converged to an initial point (A^^\c*^^^). We assume that 
the objective function F = /i + + ^2 in equation (13) 
has compact sublevel sets and is bounded below. Then an 
iterative (sub-)gradient method with appropriately chosen step 
sizes that produces updates of such that 

^(A(^+i)^< F(A^^\may converge to a local 
optimum (which is not necessarily the global optimum). If the 
functions /i, gi, and ^2 are the £ 2 , £ 1 , and £i norms as in 
equation (7), these conditions are satisfied and the algorithm 
converges to a local optimum. 


C. Simplified Estimation Algorithm 

Here, we consider the convergence of the Simplified Al¬ 
gorithm, Algorithm 2. We state in this section formally the 
underlying assumptions and the results; we also outline the 
main arguments underlying the proofs. The proofs themselves 
are detailed in Appendices A-C. We consider the theoretical 
performance guarantees provided by Algorithm 2 of the sim¬ 
plified estimate (A, c) when using £i regularized least squares 
to estimate R, 


K-l 


^ .I V- 

R = argmm — > 


k=M 


M 


x.[k] — Rix[/c — ' 


i=l 


(15) 


+ Ai||vec(R) 


which is a special case of (14), where cfu = ||E[w/cW^] ||, 
and then using A = Ri and the optimization (12) to find 
c. Dividing by cfu makes the estimation unitless, although in 
practice we might incorporate its value into the Ai parameter. 

Our error metric of interest will be: 


e =E 


_N 
-E 


1 II ^ 

x[fc] -/(A,c,Xfe_i) 


^||x[fc]-/(A,c,XU)"' 


(16) 


where 

X'fc_i = ( x[fc-l]^ x[fc-2]^ ... x[fc-M]^ 

M 

and /(A, c, X'^_^) = ^ Pi{A, c)x[/c — i]. This error e is the 
2 = 1 

average excess prediction risk, the difference between the error 
of estimating x[k] by the estimated CGP (first term in (16)) 


and the variance of the noise w[k] in the CGP given in (4) 
(second term in (16)). The expectations in (16) are taken over 
a new sample 

Zfc = (x[fc]T e (17) 

drawn independently of the samples used to estimate ^A, . 

We now state the assumptions underlying the main results. 

1) Assumptions: First, we list the assumptions we make 
about the true process in order to derive our performance 
guarantees: 

(Al) The CGP model class (4) is accurate: 

E[x[fc]|X;,_i]=/(A,c,X',_i). 

(A2) The noise process is uncorrelated with the CGP and its 
own past values: 

E [x[j]w[/c]~^] = 0 and E [w[j]w[/c]^] = 0 for j < k. 

Further, the noise sequence {w[/c]} is i.i.d. multivariate 
Gaussian with distribution w[k] ^ A/’(0, Xw), and there 
exists a£ such that 0 < < ||5]“^||2 , the singular 

values of are strictly bounded away from 0. Then 
we can represent the condition number of as Oujerg, 
where is given in (15). 

(A3) The CGP is stationary and is already in steady state when 
we begin sampling. Under this assumption, the marginal 
distributions and expectations are E[x[/c]] = 0, I]o = 
E [x[/c] x[/c]'^], and 5] = E [z/^z^] where z/^ is defined 
as in (17). 

(A4) The stationary correlation matrices of the process x[/c] 
are absolutely summable in induced norm: 

CX) 

E ||E[x[fc]x[A:-z ]^]||2 =G<oo. 

i— — oo 

This is a slightly stronger condition than stability (see 
proof of Lemma 2 in appendix A). 

(A5) The true adjacency matrix and filter coefficients are sparse 
and bounded: 

IK A P 2 (A,c) ... Pm(A,c) )|K<5Miv«:MA2 

and maxi<i<M(||A*|| 2 ) < P, maxi<i<M(||A*||o) < 
^ and ||c||o < sm and ||c||i < p, where the 
matrix “norm” ||A||o is the total number of nonzero 
entries in matrix A. The quantities Smn and sm niay 
grow with M and N, and niay grow with N. 

(A6) The following holds for the bounds L and p in (A5): 
(1 + P)(l + p) = Q < 2. This is also a slightly 
stronger condition than stability (see proof of Lemma 2 
in Appendix A). 

(A7) The sample size K is large enough relative to the “stabil¬ 
ity” of the process, the log of the network size, and the 
sparsity. Also the autoregressive model order M is low 
relative to the length of the sample size K\ 

T = K-M> CiJ Smn (logM + log A) 

M<o{logK) 

o ^ 

for some universal constant C > 0 and uo = ^ , 

where uj and /imin(Al) are related to measures of “sta¬ 
bility” of the process [28], and their explicit forms are 
given in appendix A in equations (20) and (22), and Q 
is given in (A6). 
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(A 8 ) The matrix E[B(A)^B(A)] is invertible, or alterna¬ 
tively, its minimum singular value is strictly positive: 

(E[B(A)’rB(A)])“^ ” >kbNT>0. 

We point out that assumptions (A1)-(A3), (A7), and (A 8 ) 
correspond to fairly standard assumptions in stationary time 
series analysis and studying performance of parameter es¬ 
timation. (A7) assumes enough samples to do meaningful 
estimation, which is standard in high-dimensional estimation. 
In this assumption, the minimum number of samples is linked 
to the properties of the process. Assumption (A 8 ) makes sure 
that c is identifiable when given the true value of A. 

As noted (A4) and (A 6 ) are slightly stronger than “stability.” 
This is because these assumptions are not necessarily implied 
by stationarity. We note that (A5) is perhaps the most restric¬ 
tive assumption, as it imposes explicit sparsity conditions on 
the polynomials P^(A,c) and vector c. In network science 
terms, this assumption roughly corresponds to graphs with 
relatively longer node-to-node paths, so that higher order 
powers of A are not all dense. However, this assumption 
has the same fiavor as the explicit sparsity assumptions made 
in other sparse estimation work. It would be interesting to 
relax this assumption to recent notions of approximate sparsity 
rather than exact sparsity, and that could be the direction of 
future work. 

2) Theoretical Performance: Here, we present the main 
guarantee and a brief sketch for its proof, providing several 
lemmas of intermediate results used in the main result. The 
full proof is presented in Appendices A-C. 

Theorem 1 (Main Result). For any {) < P < a < 1/2, 
and some universal constant di, assumptions (A1)-(A8) are 
sufficient for the error e in (16) to satisfy 


Before we outline the proof of the theorem, we first present 
two intermediate results. These two results use sparse vector 
autoregression estimation results to show that ||A — A ||2 and 
||c — c||i are small with high probability. Then, with small 
errors in estimating A and c, we can demonstrate that the 
error e is also small. 

Lemma 2. Assume (Al) that x[/c] is generated according 
to the CGP model with A satisfying (AS). Suppose (A7) 
that the sample size K is large enough. Let di > D 
with i = 1,...,3 be universal constants, and let g{Q) — 
ds (^1 + ( 2 -q )2 ) Imnk = a/ (log M + log N)/K. With 

sa = exp{—(i 2 Ar/cc^}, Ai = 4:g{Q)iMNK, cind R = 
(Ri,... ,Rm) the solution to (15), with probability at least 
1 — sa, a = R 4 satisfies the inequalities 

||A — A||i < 256Smn^mnkQ‘^ g{Q)cru/cri 

||A — AII 2 < ||A — A\\f < = 64V Smn^mnkQ‘^ g{Q)o'u/o-i 

q2 

where uj = ——as defined in (A7). 

This lemma states that, with the appropriate choice of 
Al, the assumptions (A1)-(A8) are sufficient to allow good 
estimation of A with high probability. That is, for each 
increase of sample size K, we choose the optimal value of the 
regularization parameter Ai, and this choice yields consistency 
with high probability. 

Lemma 3. Suppose that assumptions (A1)-(A8) hold. Then 
for any D < ^ < v < 1/2, and sparsity sm < d 4 ^K/\ogn 
where ^4 > 0 is a universal constant, and A 2 > = 

Q{Nsolution to (12) satisfies 

P (||c — c||i > ^c) ^ 

where 


e < (-5 a (l + {p + Sa)LM{SA)) + (1 + L)5c)%r(So)/iV (18) 


with probability at least 1 — sa — where 


Ljviid) = max 
l<i<M 


{L + sy - U 


Sa ~ di exp{-0(A)} 

£c ~ 2 exp{- 0 (Ai- 2 '^)} 

5a = 0 fj\ogN/K) 

Sc = 0 (filog N/K^(^-y . 


The exact expressions for ca, ^c, and can be found 
in the appendices. This theorem states that, as the number 
of nodes N and the number of time observations K grow, 
with high probability, the average excess prediction risk of 
the simplified estimate is not too large. The dependence of 6 
on K and L are through T = K — M and Q, respectively, 
where L is defined in (A5), and Q is defined in (A 6 ). The AR 
order M and network size N may also grow, as long as they 
grow slowly enough with respect to K. Note that, for large K, 
we have 5 L, and the factor Lm{S) = The 

full proof of Theorem 1 is in Appendix C, but we provide a 
brief overview here. 


= o ^^^JlogN/K^(^-d)j 
Sc ^ 2 exp{- 0 (Ar^“^'")}. 

In a similar vein as Lemma 2, this lemma states that, 
for appropriate choice of A 2 , the assumptions (A1)-(A8) are 
sufficient to yield a good estimate of c with high probability. 

Proof Overview for Theorem 1. Lemma 2 shows that we can 
achieve good performance in estimating A with high proba¬ 
bility. With considerable effort and several additional insights, 
we show that good performance in estimating A, translates 
into good estimation of c with high probability in Lemma 3. 

Then, small errors ||A — A ||2 < ^a and ||c — c||i < 
naturally lead to small prediction errors e. Concretely, we 
proceed with some algebra, showing that 

'’“s'" [||/(A,c,Xi._.)-/(A,c,Xt_.)||:_ 

< [(||/(A.c.Xl_,)-/{A,c,Xi _,)||2 

+ ||/(A,c,x;_i)-/(A,c,Xi,_i)|| 2 )' . 

Given small estimation errors, we can bound these two norms 
separately. The first can be bounded using both Lemmas 2 
and 3, and the second can be bounded with Lemma 3. 




Applying the union bound, we arrive at our result. Again, the 
full detailed proof is in the appendices A-C. □ 

VI. Experiments 

We test our algorithms on two types of datasets, a real 
temperature sensor network time series (with N = 150 and 
K = 365) and a synthetically generated time series (with 
varying N and K). With the temperature dataset, we compare 
the performance of the different algorithms (1, 2, and 3) for 
estimating the CGP model (4) against that of the sparse vector 
autoregresssive Markov random field model (3) labeled as 
MRF, where /i(x,y) = 5 l|x-y||| and gi(x) = Ai||x||i; 
with the synthetic dataset, we compare the estimates of the 
model parameters to the ground truth graph used to generate 
the data for Algorithm 2, and we evaluate the performance of 
the prediction by averaging through Monte Carlo simulations 
to empirically test Theorem 1. 

To solve the regularized least squares iterations for esti¬ 
mating the CGP matrices, we used a fast implementation 
of proximal quasi-Newton optimization for ii regularized 
problems [30]. To estimate the MRF matrices, we used an 
accelerated proximal gradient descent algorithm [31] to esti¬ 
mate the SVAR matrix coefficients from equation (3) since the 
code used in [9] is not tested for large graphs. 

A. Temperature Data 

The temperature dataset is a collection of daily average 
temperature measurements taken over 365 days at 150 cities 
around the continental United States [32]. The time series 
= (x^O] ... Xi[K-l]), K = 365, i = 0,..., 149, is 
detrended by a 4th order polynomial at each measurement 
station i to form x^. The data matrix X is formed from 
stacking the detrended data x^. 

We compare the prediction errors of assuming the detrended 
data x^ are generated by the CGP or the MRF models, or by 
an undirected distance graph as described in [3]. The distance 
or geometric graph model uses an adjacency matrix to 
model the process 

M 

x[/c] = w[k] -f /ii(A^^^^)x[/c - i], 

i=l 

Lh 

where ^ are polynomials of order Lh 

j=0 

of the distance matrix with element^ chosen as 

p ^mn 

A dist _ _^_ 

^mn ~ ! 

VEjgaa- e e“ 

with representing the neighborhood of the a cities nearest 
to city n and dmn being the geographical Euclidean distance 
between cities m and n. In this model, the number of time lags 
M is taken to be fixed, and the polynomial coefficients Cji are 
to be estimated. In our experiments, we assumed M = 2. 

We separated the data into two subsequences, one consisting 
of the even time indices and one consisting of the odd time 
indices. One set was used as training data and the other set 
was left as testing data. This way, the test and training data 
were both generated from almost the same process but with 
different daily fiuctuations. In this experiment, we compute 
the prediction MSE as 




(b) Testing Errors vs Nonzero Parameters 


Eig. 1. Compression and Prediction Error vs Nonzeros using order M = 2 
model 

^ ^ i=M+l 

Here, since we do not have the ground truth graph for the 
temperature data, the experiments can be seen as correspond¬ 
ing to the task of prediction. The test error indicates how 
well the estimated graph and corresponding estimated model 
can predict the data at the following time instance from past 
observations. The models are estimated using Algorithms 1, 
2, and 3 with /i(x,y) = |||x - y||| and gi{x.) = Ai||x||i. 
We repeated the training step for all values of Ai, A 2 , A 3 on a 
grid discretizing the interval (0,500]. We used the values of 
the polynomial coefficient regularization parameter A 2 and the 
commutativity regularization parameter A 3 corresponding to 
the lowest training error to estimate the model on test data and 
determine the test error; the sparsity regularization parameter 
Ai directly affects the nonzero proportion and number of 
nonzero parameters, as expected. 

Figure 1(a) shows the performance of the basic, extended, 
and simple Algorithms 1, 2, and 3, as well as the distance 
graph and MRF models, as a function of edge sparsity of 
the respective graphs. We see that directed graphs estimated 
from data (either CGP or MRF) perform better in testing than 
the undirected distance graphs that are derived independently 
of the data. In addition, for high^parsity or low number of 
nonzero entries in the estimated A, (pnnz = <0-3, 

where Pnnz is the proportion of nonzero edges in tne graph 
and Annz is the number of nonzero edges in the graph, 
not including self-edges), the performance of the CGP is 
competitive with the MRF model. At lower sparsity levels 
(Pnnz > 0.3), i.e., denser graphs, the MRF model performs 
better than the CGP model in minimizing training error (not 
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Fig. 2. Estimated CGP temperature graph using order M = 2 model with 
sparsity pnnz = 0.05 


shown), but not in testing. 

Figure 1(b) shows the prediction performance of the same 
algorithms as in 1 (a) as a function of the total number of 
nonzero parameters of the respective models. The same trends 
are present here as in the discussion above. Here, for the 
CGP model, the number of nonzero parameters is calculated 

where N includes the 
diagonal entries and < M(M + 1) — 3 counts the 

nonzeros in c. For the MRF model, the number of nonzero 
parameters is calculated as ^N), where 

N includes the diagonal entries and the factor of M accounts 
for the fact that the nonzero entries of can be different 
across i. We can see that for the same level of sparsity, 
the MRF model has more nonzero parameters than the CGP 
model by approximately a factor of M. Here, the CGP has 
lower test error than MRF with fewer nonzero parameters 
(Nr^nz < 2000 ). 

In figure 2, we visualize the temperature network estimated 
on the entire time series using the CGP model that has sparsity 
level Pnnz = 0.05. The x-axis corresponds to longitude while 
the p-axis corresponds to latitude. We see that the CGP model 
clearly picks out the predominant west-to-east direction of 
wind in the X > —95 portion of the country, as single points 
in this region are seen to predict multiple eastward points. It 
also shows the infiuence of the roughly north-northwest-to- 
south-southeast Rocky Mountain chain at —110 < x < —100. 
This CGP graph paints a picture of US weather patterns that 
is consistent with geographic and meteorological features. 
This consistency may be the most pleasing and surprising 
observation of this experiment. This also helps intuitively 
explain why the distance graph, which is not derived from 
data, is not as good at predicting weather trends, since cities 
that are geometrically close may be geographically separated. 


B. Synthetic Data 

We test our simplified algorithm with /i(x, y) = ||x — y ||2 
and ^ 2 (x) = ||x||i on larger synthetic datasets to empirically 
verify the theory developed in section V. 

The random graphs corresponding to A were generated 
with 4 different topologies: K-regular (KR), Stochastic Block- 
Model (SBM) [33], Erdos-Renyi (ER) [34], and Power Law 
(PL). 


The KR graph was generated by taking a circle graph and 
connecting each node to itself using a weight of —1 and to 
its ^ = 3 neighbors (we use ^ for this quantity rather than K, 
since we have used K to denote a different quantity) to each 
side on the circle with weights drawn from a random uniform 
distribution Z//(0.5,1). This resulted in a (2z/ -1- l)-diagonal 
(in this case heptadiagonal) matrix. Finally the matrix was 
normalized by 1.5 times its largest eigenvalue. 

The SBM graph was generated by creating 10 clusters 
with each of the 1000 nodes having uniform probability of 
belonging to a cluster. Edges between nodes were generated 
according to assigned intra- and inter- cluster probabilities 
summarized by a 10 x 10 matrix (intra-cluster probabilities 
were on the diagonal, and inter-cluster probabilities on the 
non-diagonal entries). This matrix was generated randomly 
and sparsely. Starting with 0.051, we added an independent 
random quantity to each element, where the variables were 
uniformly distributed on [0,0.04). Then, entries below 0.025 
were thresholded to 0. The edges generated were assigned 
weights from a Laplacian distribution with rate Ag = 2. 
Finally, the matrix was normalized by 1.1 times its largest 
eigenvalue. 

The ER graph was generated by taking edges from a 
standard normal A/’( 0 , 1 ) distribution and then thresholding 
edges to be between 1.6 and 1.8 in absolute value to yield 
an effective probability of an edge per ~ 0.04. The edges 
were soft thresholded by 1.5 to be between 0.1 and 0.3 in 
magnitude. Finally, the matrix was normalized by 1.5 times 
its largest eigenvalue. 

The PL graph was generated by starting with a 15 node 
ER graph with connection probability 0.8. New nodes were 
connected by two new edges to and from an existing node of 
weight drawn as A/’( 0 , 1 ) and then offset 0.25 away from 0 . 
The connections were made according to a modified prefer¬ 
ential attachment scheme [35] in which the probability of the 
new node connecting to an existing node was proportional to 
the existing node’s total weighted degree. The diagonal was 
set to —1/2. Lastly, the matrix was normalized by 1.5 times 
its largest eigenvalue. 

Examples of these topologies with their weighted edges and 
representative estimates A can be seen in figure 3 (higher 
absolute weights are displayed in darker blue). Because of 
the large number of nodes, these topologies are difficult to 
visualize. So, for display purposes only, we chose parameters 
that lean towards fewer false alarms at the expense of having 
more missed edges—since more false alarms obstruct and 
obscure the layout. Eor KR, see figure 3(a), we used a circular 
layout in which the true edges are along the perimeter and 
not through the interior. This is replicated in the estimated 
graph^. Eor the KR graph whose results are in figure 3(a) 
(false alarm rate (Pea) 3 x 10 “^ and miss rate (Pm) 0.16), 
we used a circular layout (as previously mentioned); for SBM, 
see figure 3(b) (displayed plot has Pea = 7 x 10“^ and 
Pm = 0 . 22 ), and ER, see figure 3(c) (displayed plot with 
Pea = 4 X10 ^ and Pm = 0.58), we used force-directed edge 

Tn the estimated graph, there are some edges through the interior with 
much lower weights relative to the weights on the edges along the perimeter, 
which are not visible when visualized. 
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(a) True A (left) and Estimated A (right) for K-Regular graph 



(b) True A (left) and Estimated A (right) for Stochastic Block-Model graph 



(c) True A (left) and Estimated A (right) for Erdos-Renyi graph (d) True A (left) and Estimated A (right) for Power Law graph 

Fig. 3. True (left) and estimated (right) edge weights (absolute values) for K-Regular (top left), Stochastic Block-Model (top right) Erdds-Renyi (bottom left), 
and Power Law (bottom right) graphs for N=1000 


bundling [36] to group nearby edges into few thicker “strands” 
in addition to circular layouts; and for PL, see figure 3(d) 
(displayed plot with PpA = 4 x 10“^ and Pm = 0.69), 
we used a Fruchterman-Reingold [37] node positioning. If we 
lower the value of the sparsity regularization parameter Ai on 
our grid, we can reduce the miss rate, while increasing the 
false alarm rate. Just for reference, we provide another pair 
of corresponding PpA and Pm values: for KR, PpA = 0.01 
and Pm = 0.07; for SBM, PpA = 0.01 and Pm = 0.40; for 
ER, Pfa = 0.03 and Pm = 0.25; for PL, PpA = 0.03 and 
Pm =0.52. Playing with Ai we can get other tradeoffs more 
suitable to specific applications. Just from these experiments, 
it seems that the SBM and PL models are more difficult to 
estimate than the KR and ER models. This could be due to the 
average degree of SBM being high and the maximum degree of 
PL being high while the number of time observations K is held 
constant throughout these experiments, which would violate 
Assumption (A5) or (A7). A more thorough understanding of 
these tradeoffs, such as what additional conditions need to 
be satisfied for relevant guarantees to hold, and what broad 
classes of graphs satisfy these conditions, is left as a topic of 
future investigation. 

Once the A matrix was generated with N G {1000,1500} 
nodes, for fixed M = 3, the coefficients Cij for 2 < i < M 
and 0 < j < i were generated sparsely from a mixture of 
uniform distributions 2*+-^Qy ^ |Z7( —1, —0.45) +^Z7(0.45,1) 
and normalized by 1.5 to correspond to a stable system. 

The data matrices X were formed by generating random 
initial samples (with 500 bum in samples to reach steady state) 
and zero-mean unit-covariance additive white Gaussian noise 
w[k] computing K G {750,1000,1500, 2000} samples of x[/c] 
according to (4). Then 20 such Monte-Carlo data matrices 


were generated independently, and (A, c) was estimated using 
the simplified algorithm for varying values of Ai and p. The 
average error of the A matrix was computed as 

MSE=^\\A-m. 

The empirical value of the error metric e from (16) was also 
measured by generating another 20 independent sets of time 
series z/. (as in (17)) at steady state and using the estimates 
to predict x.[k] using 

In figure 4, we show the errors in A and the empirical 
estimates of e for different values of N and K. We performed 
the estimation across a grid of A 3 and p and choose the lowest 
observed error to be e. We see several different behaviors. We 
observe that the average errors in A and empirical averages 
for e for the KR graph decrease with both larger N and K. We 
observe similar behavior for the SBM graph. This suggests that 
graphs generated according to these topologies might satisfy 
the assumptions in section V. On the other hand, the error in A 
for ER does not display a clear trend in N, although observing 
more samples still improves the estimate of A and the model 
prediction performance as expected. For the PL model, the 
error in estimating A decreased with increasing N and K, 
but the error e fiuctuated around 0 with no clear trend. 

This varied behavior arises because the different structured 
and random graph topologies examined tend to exhibit certain 
network properties that do not all correspond directly to the 
assumptions in section V. In particular, the K-regular graph by 
its construction does satisfy the assumptions. This is because 
taking the n-th power of the adjacency matrix of a ^-regular 
graph of this form results in an (n^)-regular graph, which 
satisfies the sparsity (A5) with some constants that do not grow 
too fast with N and K. Thus, the results of the KR graph 
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(a) Error in A (left) and e (right) for KR graph 


(b) Error in A (left) and e (right) for SBM graph 




(c) Error in A (left) and e (right) for ER graph (d) Error in A (left) and e (right) for PL graph 


Eig. 4. Error in A (left) and e (right) for K-Regular (top left), Stochastic Block-Model (top right), Erdos-Renyi (bottom left), and Power Law (bottom right) 
graphs for N= 1000,1500 


empirically conform to the predictions given by the theory 
in section V. With the other topologies, the behavior of the 
sparsity constants is not as immediately clear, but the observed 
results suggest that some of the assumptions could be slightly 
loosened, or that other network statistics (e.g., diameter or 
maximum degree) could play a role in the performance. 

VIL Conclusions 

This paper presents a methodology to estimate the network 
structure (graph) capturing spatial (inter) and time (intra) 
dependencies among multiple time series. These data may 
arise in many different contexts. The data time dependencies 
are modeled by an auto-regressive (AR) process. The spatial 
dependencies are captured by describing the matrix coeffi¬ 
cients of the AR process as graph polynomial filters [11]. 
The paper presents three algorithms to estimate the graph 
adjacency matrix and parameters of the graph polynomial 
filters. These algorithms optimize cost functions that combine 
a model following functional, e.g., an £2 error metric, with 
sparsity regularizers, e.g., £i metric. The paper carries out 
the convergence and performance analysis of these algorithms 
under a set of appropriate assumptions. Finally, the paper 
illustrates the performance of these algorithms with a set of 
real data (temperature data collected by 150 weather stations 
covering the US) and with simulated data for four different 
types of networks. These experiments show the advantages 
and limitations of the approach. 
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Appendix A 
Proof of Lemma 2 

Lemma 2. Assume (Al) that x[/c] is generated according 
to the CGP model with A satisfying (A5). Suppose (A7) 
that the sample size K is large enough. Let di > ^ 
with i = be universal constants, and let g{Q) — 

d 3 (^1 + ^MNK = \/(log M + log N) / K. With 

sa = (ii exp{—(i 2 A/co’^}, Ai = ^g{Q)^MNK, cind R = 
(Ri,... ,Rm) the solution to (15), with probability at least 
1 — sa, a = Ri satisfies the inequalities 

IIA — A||i < 256SMNiMNKQ‘^giQ)cru/cri 

||A — A ||2 < ||A — A||f < = 64V Smn^mnkQ^ g(Q)cru/cr£ 

where uj = ^as defined in (A7). 

/Xmin(A) 

First we define several quantities that correspond to the 
“stability” of the system and are used throughout the proof. 

M . _ _ _ 

Let A{z) = I — X] c)z^ and A(z) = I — zA where A 

i=l 

is the system matrix for the stacked state in companion form. 
That is, 


/A 

P 2 (A,c) ... 

Pm-i(A,c) Pm(A,c) \ 


I 

0 

0 

0 


0 

I 

0 

0 

, ( 20 ) 

0 

0 

I 

0 ) 



x.[k] = Ax.[k — 1] + w[/c] where x[/c] = (x[/c]^ ... x.[k — 
M]^)^ and w[k] = (w[/c]^ 0^ ... 0^)^. Then define 

/imax(^) = max ||A(2;)||2 
Nl=l 

l^miniA) = min ||(. 4 { 2 :))- 1 ||^^ 

kl=l 

Now we proceed with the proof. 


Proof of Lemma 2. To use Propositions 4.1-4.3 from [28], we 
bound the quantities /imin(Al) and p^max{A). Letting B{z) = 
I-A{z), 

M 

max (A) < 1 + max VI ||A ||2 + y^ max V*| ||Pi(A, c )||2 
Nl=i ^ Nl=i 

1=2 

M i 

< 1 + ^ + XIXI 11 ^ 

i=2 j=0 

M i M 

< 1 -h L + L X X I X |Czo| 

i=2 j = l i=2 

< (1 + L + Lp + p) = Q (21) 


by assumption (A5), and similarly 


s/ Mmin ^A) 


min 

kl=i 


{l-B{z))-^ 


-1 

2 


(a) 

- ^ “ “ax|l(^(^))ll 2 > 2 - Q, 

Nl=l 


where the inequality marked (a) is due to assumption (A5) and 
the rest follows from similar logic as (21). Propositions 4.2- 
4.3 hold with high probability when T > doSMN^‘^0-OgM -\- 
logA), where 


ll^wll Mmax(^) 
hmin{A) 


( 22 ) 


Thus, we take uj = uj, and when assumption (A7) holds, 
this condition is satisfied as well. Finally, we substitute these 
upper and lower bounds for p.min{A) and p^max{A) into 
the statements of Proposition 4.1 to yield the statement of 
Lemma 2. □ 


Appendix B 
Proof of Lemma 3 


Lemma 3. Suppose that assumptions (A1)-(A8) hold. Then for 
any t)<P<v <1/2, and sm < d^K/log n and X 2 > Qi, 
the solution to ( 12 ) satisfies 

-P(||c-c||i > (5c) < Ec 

where 


5c = 645mA2/q^i 

and 

Sc =£Re + ^A + exp{-d4A:^“^^} 

+ ^De + ( 6 Af + 1 ) OX.p{ — dfT^ 
where Srq and Sdc defined in Propositions 3.C and 3.D 
found in subsections C-D, respectively, sa is defined in 
Lemma 2, and ai and qi are given as 

ai = kbNT/{ 2K‘') + SReK{tr{-Eo) + GVN) 
qi = y/NtZnaugiQ) + 5 j<m{.5p^^u, 

where 

u = 2 T^“^v^ 7 rcr„ 5 (Q) 

+ {L+5AL{5A))MpK^-\tr{i:o) + GW) 
and Src = 5ALM{,5A)n ^1 + i) + 5aLm{,5a/- 


We see that the interpretation of this lemma is similar 
to that of Lemma 2, so that taking A 2 = Qi is suffi¬ 
cient to achieve the performance in the lemma. We know 
that = ©(A/logTV/AT). Then with this choice of A 2 , 
we have ai = Q{NK^-^ + CVNKSRe) = e{NK + 
GVNK5 a) = eiNK^-^' + ^ {N log N)K) = Q{NK'^-^) 
anci u = Q{'/NK^~^) (from the seconci term) so that qi = 
e{VNK^-l^ + ^ {N log N)/Ku) = 
thus (5c = 0{qi/ai) = O (^^^logN/ Also, note 
that 

5c ^2 exp{—0(A^“^^)} + di exp{—0(A)} 

+ (12M + 2) exp{-0(Ti“2/3)} 

~2exp{-0(Ki-2i^)} + (12M + 2) exp{-0(Ti“^'^)} 

~ 2 exp{- 0 (Ki“ 2 (^)}, 

since is the slowest growing exponent. 

To prove this lemma, we need two intermediate results 
showing that 1 ) a restricted eigenvalue (Re(a,r)) condition 
holds with high probability; and 2) a deviation (De(g)) condi¬ 
tion holds with high probability. These two conditions, which 
will be described shortly, describe the geometric properties 
of the objective function in the sparse optimization problem. 
Together, these conditions imply the desired result. These 
conditions are commonly encountered in sparse estimation, 
and the proof follows closely [28]. 
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A. Eigenvalue and Deviation Conditions 

We begin with the statements of the two conditions. Let 
n = (M — 1)(M + 4)/2, the length of c. 

The Re(a,r) condition is satisfied for T G if for all 

vectors 0 

9^re>a\\9g-T\\9\\l (23) 

In sparse estimation, this condition is used to show curvature 
of the objective function is large enough near the true value of 
the parameter in sparse directions [38]. In other words, large 
perturbations in the objective function correspond to small 
perturbations in the parameter estimate. 

The De(g) condition is satisfied for (r, 7 ) xW^ if 

for the true value of the parameter c, 

|| 7 -rc||oo <g. (24) 

In sparse estimation, this condition intuitively states that 
the gradients of objective function are small near the true 
parameter value [38]. In other words, the estimated parameter 
value that minimizes the objective function is near the true 
parameter value. 


B. Discussion 

For our problem, we wish to show that Re(Q(i, r) and De(gi) 
are satisfied for (r, 7 ) = ^B(A)^B(A), B(A)^Y(A)^ with 
high probability. If these two conditions hold for these matri¬ 
ces, then the estimate c will satisfy ||c — c||i < QAsm^ 2 /oli 
for A 2 >gi. 

Usually, in performance analysis of sparse estimation, 
these two restricted eigenvalue and deviation conditions 
would be shown using the true data (i.e., in our prob¬ 
lem, that Re(Q(o,T) and De(go) are satisfied for (r, 7 ) = 
(B(A)^B(A),B(A)^Y(A))). Matrices B(A) and Y(A) 
have entries that can be described by a multivariate Gaussian 
variable, so showing the two conditions on (r, 7 ) is relatively 
straightforward. However in our two-stage estimation, we 
can only use the estimated parameter A instead of the true 
parameter A. Thus, the challenge in showing that these two 
conditions are s^isfied comes from only being able to use 
the matrices B(A) and Y(A) that are complicated random 
functions of the true parameters and data, and are no longer 
described by a multivariate Gaussian variable. 

Our approach is to show that if the two conditions are 
satisfied on the matrices B(A) and Y(A), then the two 
conditions will also be satisfied on the matrices B(A) and 
Y(A) that are actually available at the second stage. Thus, we 
rely on two additional intermediate results, which we prove 
first, showing that these two conditions are indeed satisfied 
on the matrices B(A) and Y(A) with high probability. We 
additionally point out that although the problem of estimating 
c is overdetermined, it may still be ill-conditioned due to the 
possibly highly correlated form of the data matrices B(A) 
and Y(A). Since it is not immediately obvious that the 
performance would be good, it is important to still demonstrate 
that these conditions hold. 

To summarize the above discussion, for our problem we 
first show that Re(Q(o,T) and De(go) are satisfied for (r, 7 ) = 


(B(A)^B(A),B(A)^Y(A)), and then show that this im¬ 
plies that Re(Q(i,T) and De(gi) are satisfied for (r, 7 ) = 
^B(A)^B(A), B(A)^Y(A)y We show Re(Q(i,T) in Propo¬ 
sition 3.C and De(gi) in Proposition 3.D. 


C. Proof of Supplementary Proposition 1 

Proposition 3.C. Suppose that assumptions (A1)-(A8) hold. 
Then for any 0 and any 0 < z/ < 1 /2, 

P{\\B{A)9\\l<ao\\9\\l-r\\9\\l)<sne 

where r = d^ao{l /{2{logn)n‘^NT^), ao = 

nBNT/2, and SRe = 2exp{-d^n‘^T/{{I ^ 

That is, Re{ao^r) holds for estimating c using B(A)^B(A) 
with probability at least 1 — erq. 


Proof This proof follows loosely the proof of Proposition 4.2 
in [28], first bounding the quadratic form of B(A)^B(A) 
using the Hanson-Wright concentration inequality followed by 
a discretization argument. 

LetD(A,0) = (Do(A,6')'^ Iii{A,0V - Dk-m-i(A, 0)'")'" 
where the block rows of V{A^0) are 

'Di{A,0) = {Onxni P2{A,0) ... Pm{A,0) 0N X N{K-M-2-^) ). 

Then consider any vector || 6 >|| < 1, 


||B(A)0||i = ^ 


y~'^ Pi{A, 6)x[k — i] 


(25) 


= ||P(A,6l)x[0:A-: 


Hi, 


where x[ 0 : A —3] = ( x[ 0 ] 


c[K-3] 


. By an ex¬ 


tension of Gershgorin’s Circle Theorem to block matrices [39], 

M M i 

mA,9)\\2 < ^||P,(A,0)||2 < ^ 


i=2 


i=2 h=0 


Mi Mi 

<y^^i%iiiA^ii2<(i+L)y^^i%i 

i=2 j=0 i=2 j=0 

= (i + i:)||0||i<(i + i:)v^||0||2 
^ (1 + D)\/n. 


Now, 

||B(A)0||2-E[||B(A)0||2] 

= \\V{A,9M0:K-3]\\l--E[\\V{A,9)^[0:K-3]\\l] 


< \\V{A,9)\\l\\^o\\NKr] < {1 + LfnGNKr] (26) 

with probability at least 1 — 2 exp{—d 4 NK min(ry, for 
some universal constant ^ 4 , where (a) follows from the 
Hanson-Wright inequality [40]. Then from the discretization 
argument of Lemma F.2 [28], for an integer 5 > 1 (to be spec¬ 
ified later) and set IC 2 S = {0 : ||6>|| < 1, ||6>||o < 25 }, 

p( sup ||B(A)0||i - E [||B(A)0||i] > (1 + LfnNKGi^ 

\9eK:2s ) 

< 2 ex.p{—d 4 NK min( 7 , 77 ^)+ 25 min(logn, log(21en/25))}. 
Finally, from Lemmas 12 and 13 in [38] and taking 
5 = \d4NKp‘^/{4:\ogn)] and 7 = /^b^2"/[54(1 + 

L)‘^nGNK^] < 1 with 0 < u < 1/2 so that 

7 ^ < 7 where n^NT is the minimum singular value of 
E[B(A)^B(A)] as defined in (A 8 ), we have B(A)'rB(A) 
satisfies Re{K^NT /( 2 A^), r^NT / {2sK^)) with probability 
at least 1 - 2 exp{-d 4 /^^T/((l + LYn^G‘^K‘^^)}. □ 
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D. Proof of Supplementary Proposition 2 

Proposition 3.D. Suppose that assumptions (A1)-(A8) hold. 
Then for any 0 < < 1 /2, 

P(||B(A)^e||oo > go) < 

where go = 2LT^~^ ^/NtNT:(Jug{Q) cmd sdc = 

6M exp{ —d 4 T^~‘^^ }. 

That is, De{qf) holds for estimating c using Y(A)B(A) with 
probability at least 1 — e De- 

Proof The proof is a straightforward application of Propo¬ 
sition 2.4 from [28] and the union bound (as in the proof of 
Proposition 4.3). Letting e = Y(A)—B(A)c and T = K — M, 

K-l \ 

^x[fc -i]'^(AO'^w[fc] I 

/(P=M / 

< max \tr ({A^)^w[M: K—l]x.[M— i: K— 

l<i<M I V /I 

1< 3 <i 

< max T||A^'||i||w[M:A-l]x[M-i:A-l-i]'^/r|| 

1 < i < M II 11OO 

l<j<i 

< T max VLv||A-^||f (^Trcrn i^ 

"V V Tmin{A) J 7 

^ ^ .• 

< 27 LvT max yN\\A ^\\2 (TTaug{Q)r)) 

l<j<M 

< 2LT V NtNTi(Jug{Q)g 

with probability at least 1 — QMeyip{—dfT min( 7 ^, 
where is the maximum sparsity of A^ as defined in 
assumption (A5), (a) is implied by Proposition 2.4 from [28] 
and ( 6 ) is implied by the analysis in Lemma 2. To finish the 
proof, we take p = T~^ for any 0 < < 1 / 2 , noting that for 

this choice, < p. □ 


B(A)^e 


max 

l<i<M 
1< j <i 


E. Proof of Lemma 3 

Armed with these two results, we resume with our lemma. 

Proof From Lemma 2, we have that P( 11 A—A11 2 > • 

Rrst, we consider the Re condition. Let = ai — ao and 
B = B(A)^B(A) - B(A)^B(A). Then, 

p(||B(A)0||i<ai||0||i-r||0||?) 

= p(||B(A) 0||2 < (ao+Ac)|10||"-r||0||?f|||P||2<A„) 

+ P (||B(A)0||^ < ai||0||^ - r\\e\\i f| ||B ||2 > A„) 

< P (\\B{A)0\\l<ao\\0\\l+e^B0-T\\e\\l f| IIPII 2 < A„) (27) 
+ P(||P||2> Ac) 

< P {\\BiA)0\\l < aoirnl - r\\0\\i) 

+P (||P||2> Afl ||A - A||2<5 a)+P (||A - A||2><5a) 

^ SRe + £A + p(||P||2> Acf|l|A-A||2<5A). 

We bound the last probability by manipulating the first term, 
IIBII 2 = ||B(A)^B(A) - B(A)^B(A)||2 

< ||B(A)^B(A)-B(A)^B(A)||2 
+ ||B(A)^B(A)-B(A)^B(A)||2 

< |1B(A)-B(A)||2 (2||B(A)||2 + ||B(A)-B(A)||2) . 

(28) 


Under the constraint || A—A ||2 < (JA^for any vector || 0||2 < 1 
following from the same logic as (26), 

||(B(A)-B(A))(?||i = ||(P(A,d)-P(A,d))x[0:A-3]||i 
< (<5aLm(<5a))L||X||| 


and ||B(A)||i<(l+i:)2n||X|||.. (29) 

Thus, letting 

^Re = ^A^m(^a)^ (2(1 + L) + SaLm{^a)^ , 
we have 


\\B\\2 < 5Re\\X\\l ( 30 ) 

^p(||P||2>A„p|||A-A||2<<5A)<P(||X|||>Ac/<5Re) <£i3, 
where by the Hanson-Wright inequality [40], taking Ac^ = 
6ReK^~^{tr{T,o) + G\fN), we have es = exp{— 

Finally, 

p(||B(A)6l||i <ai||6l||i-T||6l||f) <£Re + eA+£B. ( 31 ) 

Second, letting e = Y(A)—B(A)c and e = Y(A)—B(A)c 
and Ag = gi — go, we consider the De condition. 

P (||B(A)'^e||^ > gi) < P (||B(A)^e||^ > go) 

+ p(||B(A)'^e-B(A)'^e|| > A,) 

<eDe + P (||(B(A)'^ - B(A)'")e||^ > A,,) 

+ p(||B(A)^(e-e)||^>A,,), 
where Aq = Aq^ + Aq ^. We examine the term 

(B(A)-B(A))^e 


= max 

l<i<M 

= max 

l<j <i 




x[/c — iy {A^ — A'^')^w[/c] 


k=i 


(32) 


tr ((AJ-A^yw[M:A:-l]x[M-z:A:-l-z]’^) 


<SaLm{Sa)V^^^^E max w[M : K-l]x.\M-i: K-l-iy 

l<i<M 


< 25aLm{5a)R~^N tNT^<Tu9{Q) 
with probability at least 1 — QMeyip{—dfT^~‘^^}, where 
is defined in (A 6 ), and in (a) we again invoke Proposition 2.4 
from [28] similarly as in Proposition 3.D. 


To finish, we see that 

M m 

w[fc] -w[k] = ^2 y^c^t(A^-A^)x[A;-TO] 

m=l ^=1 


and that ||A ^'||2 < ||A^' + (A^ - AOII 2 <L + 5aL{Sa). 
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These imply, 
B(A)^(e-e) 


< max 

l<i<M 
(a) 


I CX) 

fT-l-H 


—i]^(A'^')^(w[/c] — w[, 


k=i 


<{L + 5aL{Sa))SaL{Sa)Vn 


M 

X max 

m=l 


rr-i-H 


||cm- \\x.[k-i]'^x[k-m] 


k=i 


<{L + 5aL{Sa))SaL{Sa)VnMp\\X\\% 

< {L + SAL{SA))SAUSA)VNMpK^-^{ti{'Eo) + GVN) 
with probability at least 1 — exp{—where (a) 
follows from similar logic as (32) and ( 6 ) from similar logic 
to (30). 

Finally, we arrive at 

Aq=SALM{SA)VNu, 

where 

u — 2 T^ytN 7 Tcrug{Q) + {LS aL(Sa)) M pK(tv(Yjo) + G^/N) and 
p(||B(A)^e| ^ Qi^ < SDe-^ -\-l) expl—d^T} since 

T < K. □ 


Appendix C 
Proof of Theorem 1 

Finally, with the two lemmas in hand, we return to the proof 
of the main theorem, which we restate for convenience. 

Theorem 1. For any 0 < /3 < a < 1/2, and some universal 
constant di, assumptions (A1)-(A8) are sufficient for the error 
e in (16) to satisfy 

e < [Sa (l + {p + 5c)Lm(5a)) + (1 + L)5c)%r(So)/iV (33) 
with probability at least 1 — ~ where 

9 

Lm(o) = max --- 

l<i<M 5 

SA ~ di exp{-0(isr)} 

Ec ~ 2exp{-0(A^“^'')} 

(5a = O (yiogiV/A) 

S^ = 0 (fog N/K'^G-d) 


(34) 


Proof of Theorem 1. First, applying the union bound to results 
from Lemmas 2 and 3, we see that 

P - A ||2 < <5Aj Pi (||c - c||i < <5c)) > 1 - EA - ffc- 

Thus, we proceed using ||A — A ||2 < and ||c — c||i < ^c- 
Then, ||A ||2 < ||A + (A-A )||2 < L + ^a- Similarly, ||c||i < 
p + ^c- Then, 

max ||A*-A *||2 

l<i<M 


< IIA — AII2 max 


^A^A 

i=o 




< ^A max (L + SaV^ 

j=0 

< SaLm{Sa)‘ 

Note that Lm{S) = 0{ML^~^) when (5^0. 

Next, dropping from the list of arguments for the function 
/(A,c,X'^_^) defined below equation (16) the explicit de¬ 
pendence on for compactness of notation, 

E[||x[fc]-/(A,e)ll2] -E [||x[fc]-/(A,c)||^] 

(/(A,c)-/(A,c)) (2x[k]-f{A,c)-f{A,c)J 

(/(A,c)-/(A,c)) (2w[k]+f{A,c)-f{A,c)'j 


= E 




= E 
= E 
< E 

(b) 


||/(A,c)-/(A,c)||^^ 
||/(A,c)-/(A,c)+/(A,c)-/(A,c)||^] 

A,c)-/(A,c)||2 + ||/(A,c)-/(A,c)||2)' 

< (||D(A,e)-D(A,c )||2 + ||D(A,c)-D(A,c)|| 2 )'E [||X',_i||^] 
= ( ||D(A,c)-D(A,c )||2 + ||D(A,c-c)|| 2 )'Mtr(So), 

Vi V 2 

where D(A',c') = ( A' Pi(A',c') ... Pm(A',c') ), 

and where (a) is due to x[/c] —/(A, c) = w[k] P x[ji] Vj < k, 
and (b) is due to D(A', c')X'^_^ = /(A',c'). Now consider 
the term Vi, 

M M i 

Fi < ^||P,(A, c) -P4A, c) II 2 < 1% III- A^' II 2 

i=l i=2j=l 

<^A + ^a^m(^a)||c||i < Sa (^1 F {p F Sc)Lm{Sa)^ • 

Next we similarly bound 1 ^ 2 , 

M M i 

^2 < E ll^^(A,c - c )||2 < EE 1% - 


M 


i=2 j=0 


M i 


< El^^O +EEl% -Qill|A^||2 

i=2 i=2 i=l 

(a) 

P ||c — c||i + P||c — c||i < (1 + L)^c- 
where (a) follows from ||c — c||i < Sc proven in lemma 3, 
and the result follows. □ 


< ^A max y^ A 
0<i<M-l II 

j=0 


iiAiir 


(35) 



